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Abstract — New methods are presented for the machine 
recognition and learning of categories, patterns, and knowl¬ 
edge. A probabilistic machine learning algorithm is de¬ 
scribed that scales favorably to extremely large datasets, 
avoids local minima problems, and provides fast learning 
and recognition speeds. Templates may be created using 
an evolutionary algorithm described here, constructed with 
other machine learning methods, designed by a human 
expert or synthesized using a combination of these methods. 
Each template has a prototype and matching function which 
can help improve generalization. These methods have appli¬ 
cations in bioinformatics, financial data mining, goal-based 
planners, handwriting recognition, machine vision, natural 
language processing / understanding, search engines, strat¬ 
egy such as business and games and voice recognition. 

Keywords: evolution, machine learning, pattern recognition, poly¬ 
morphous, template 

1. Introduction 

New machine learning methods and knowledge represen¬ 
tation have sought to provide superior technology appli¬ 
cations. In this regard. Machine Learning with Templates 
pertains to recognizing categories and predictive modelling, 
which can be important components of advanced software 
technology. 

Machine Learning with Templates was designed to use the 
advantages of bottom-up and top-down methods. Over the 
last few decades, some practitioners in AI, cognitive psychol¬ 
ogy, machine learning and neurobiology have observed that 
some cognitive tasks are better suited to bottom-up methods, 
while other tasks are better performed by top-down methods 

(See m, mi, ed, on, ed, ed, god. 

An example of a bottom-up representation is a feedfor¬ 
ward neural network that uses a gradient descent learning 
algorithm (JT], 111 21 1). applied to handwritten digit recognition 
ED- Other types of tasks use top-down methods. For exam¬ 
ple, IBM’s Deep Blue software program plays chess ED- 
Hammond’s CHEF program creates new cooking recipes by 
adapting old recipes ll23l . 

2. Summary of Useful Properties 

Machine Learning with Templates has some useful prop¬ 
erties. 


1) Categorization is probabilistic and polymorphous (See 
S3, Gil, and pages 26-31 in Q). 

2) Learning algorithm 5.1 is extremely fast. It takes less 
than one minute - for a 5 Ghz Intel Pentium 4 com¬ 
puter - to build templates that successfully recognize 
handwritten letters with an error rate less than 0.5%. 
Some neural network training times for handwriting 
recognition take considerably more time. Further, the 
design of the neural network architecture may take 
human researchers many weeks. 


3) Learning algorithm [5TT] does not use a greedy optimiza¬ 
tion algorithm such as gradient descent E, so it avoids 
local minima problems. On extremely large datasets, 
locally greedy algorithms may not adequately train in 
a practical amount of time. 


4) Each template has a prototype and matching function 
which can help improve generalization. In some appli¬ 
cations, the use of prototypical examples and templates 
designed by a clever human expert can substantially 
increase machine learning accuracy and speed. 


5) Recognition algorithm 4.1 is extremely fast. It scales 
well on huge datasets and large numbers of cate¬ 
gories because it exploits exponential elimination. As 
an example, consider the task of recognizing Chinese 
characters |[25i . Some estimates state that there are 180 
million distinct categories of Chinese characters. When 
the learning algorithm builds a set of templates that on 
average eliminate | of the remaining Chinese character 
categories, then one trial of the recognition algorithm 
on average uses only 46 randomly chosen templates 
to reduce 180 million possible Chinese categories to 
a single category. (46 is the largest natural number n 
satisfying inequality 180,000,000 * (|)" < 2.) 


6) Machine Learning with Templates is flexible enough to 
create useful applications in bioinformatics, financial 
data mining, goal-based planners, handwriting recogni¬ 
tion, information retrieval, machine vision, natural lan¬ 
guage processing / understanding and voice recognition. 


3. Definitions and Template Structure 

The space C is called a category space. Sometimes a space 
is a mathematical set, but a space may have more structure 
ED If the category space is about concepts that are living 




The shape has a loop in it. 
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Fig. 1: Shapes with loops and no loops 


creatures, then typical members of C are the categories: 
dog, cat, animal, mammal, lizard, starfish, tree, barley. 
A different category space is the letters of our alphabet 
{a,b,c,... ,y, z}. Another abstract type of category space 
may be the functional purpose of genes. One category is 
any gene that influences eye color; another category is any 
gene that codes for enzymes used in the liver; and another 
category is any gene that codes for membrane proteins used 
in neurons. In short, the structure of the category space 
depends upon what set of possibilities that you want the 
software to retrieve, recognize or categorize. 

The example space £ represents every conceivable exam¬ 
ple. Consider a software program that recognizes handwrit¬ 
ten letters used in the English language G3- The example 
space is every handwritten a; ... ; and every handwritten z. 
The set of training examples {ei, e 2 ,..., e m } is a subset of 
the example space £. 

The function G : £ —> V[C) is an ideal map or ideal 
target function lf20l . where 'P(C) is the power set of C. In 
this case, each example e in £ can be classified as lying 
in 0, 1 or multiple categories. The goal is to construct a 
function g : £ —> V{C) so that g{e) = G(e) for every e £ £. 
The example e = tiger lies in the categories cat, animal and 
mammal. In other words, G(e) = {cat, animal, mammal}. 
For some applications, it may impossible or extremely 
difficult to explicitly describe G with a mathematical formula 
or representation. In other implementations, the results of the 
template recognition algorithm can interpret e as having a 
probability p(c) of lying in category c for each c G C. In 
these cases, the goal is to build g : £ > [0, l] c close to G. 

Templates are similar to classifiers l20l . but have addi¬ 
tional structure. Templates are used to distinguish between 
two different categories of patterns, information or knowl¬ 
edge. For example, if the two different categories are the 
letters a and m, then the shape has a loop in it is a useful 
template because it distinguishes a from m. (See figure [Tj) 

Distinct from classifiers, templates have prototype and 
matching functions. Let {T\, X 2 ,..., T n } denote a collection 
of templates, called the template set. Associated to each 
template, there is a corresponding template value function 
Ti : £ —> V, where V is a template value space. There are 
no restrictions made on the structure of V, which may be 


a subset of the real line, a subset of the integers, a discrete 
set, a manifold, or even a function space. 

Each template T( has a corresponding prototype function 
Ti : C —> 'P(V). The prototype function is constructed during 
the learning phase. If c is a category in C, then 7 ((c) equals 
the set of prototypical template values that one expects for 
all examples e that lie in category c. Intuitively, the prototype 
function represents how template Ti generalizes to every 
example e. 

For each {template, category) pair, denoted as {Ti, c), 
there is a matching function M(t uC ) ■ V —> S, where S 
is the similarity space. The matching function determines if 
the template value is similar enough to its set of prototypical 
values. In general, the similarity space S is the range of the 
matching function, and can be the unit interval [0,1], a subset 
of K n , a discrete set, a manifold or a function space. 


Example 3.1: Boolean Matching Function 

Let V = {0,1}. Choose similarity space S = {0,1}. If 
template value V\ is similar to its prototypical set T t (c), then 
M(Ti,c){ v 1 ) = 1- If v i is not similar enough to its prototyp¬ 
ical set Ti{c), then Mpr i C ){v 1 ) = 0. Choose two categories 
{ci,C 2 } and two templates {Ti,T 2 }. Define prototypical 
functions for Ti and T 2 as 7i(ci) = {1} and 7i(c2) = 
{0,1}. T 2 (ci) = {0,1} and T 2 (c 2 ) = {0}. There are four 
distinct matching functions C1 ), M( T2|Cl ), M^ Tl ^ and 

M (T 2 , c 2 )- 


1) M (Ti>Ci) (0) =0 AND 

2) M (T2>Cl) (0) = 1 AND 

3) M (Ti>C2) (0) =1 AND 

4) M (T2>C2) (0) = 1 AND 


M (TiiCi) ( 1) = 1 

M {T 2 ,c 1 ) (1) = 1 
^(T 1 ,c 2 )( 1 ) = 1 
M (T 2 ,c 2 )( 1 ) = 0 


4. Template Recognition 

The template recognition algorithm categorizes an exam¬ 
ple e from £. When finished, for each category c in C, there is 
a corresponding category score s c , which measures to what 
extent the algorithm believes example e is in category c. 

Algorithm 4.1: Template Recognition Algorithm 

Allocate memory. Read learned templates 
{Ti,.. . , T n } from long-term memory. 

Initialize every category score s c to zero. 
Outer loop: m trials. 

{ 

Initialize set R equal to C. 

Inner loop: choose p templates randomly. 

{ 

Choose template Ti with probability pk. 

For each category c £ R 

if M(T k ,c){Tk{e )) = 0, then set R := R— {c}. 
(Remove category c from R.) 

} 

For each category c remaining in R, category 
score s c := s c + 1. 


Initialize A to the empty set. 



For each category c in C, if (^) >8, A := A U {c}. 

The answer is A. The example e is in the 
categories that are in A. 

Comments on the Template Recognition Algorithm. 

1) Each template fa has a corresponding probability p, of 
being chosen during the inner loop, m is the number 
of trials in the outer loop, p is the number of templates 
randomly chosen in each trial. 9 is a number in the 
interval [0,1] that is the acceptable category threshold. 

2) In some applications, the category threshold is not used. 
Each value — is interpreted as the probability that 
example e lies in category c. 

3) If only the best category is returned as an answer, rather 
than multiple categories, then do this by replacing the 
step For each category c in C, if (^ £ ) > 9, 
A := A U {c}. Instead, search for the maximum 
category score if there are a finite number of categories. 
The answer is the category or categories that have the 
maximum category score. 

4) If template space V and similarity space S have more 
structure, then the test if M( Tk C \{Tk{e)) = 0 inside 
the inner loop may be replaced by the test if fa(e) 
is not close to prototype value fa(c). In 
this case, matching function M( Tfe C \ measures the 
closeness of fa(e) and fa(c). 

5) In some applications, the inner loop may be exited if 
R has only one element left (i.e., \R\ = 1) before all p 
templates have been applied. 


5. Template Learning 

The initial part of the learning phase constructs the 
templates from simple building blocks, using the examples 
and the categories to guide the construction. Templates can 
be built with evolution, another machine learning method JT), 
by a human expert with domain expertise or by a combina¬ 
tion of these methods. In the next section, evolution algo¬ 
rithm 6.1 builds template value functions 7’, : £ —> V from a 
collection of building block functions {fa, fa,... , fa}. The 
rest of the learning builds the matching functions Mi T . c ), 
constructs the prototype functions fa : C —> V(V), computes 
the probabilities p^, and in some cases sets the category 
threshold 9. 

The learning starts with a collection of training examples 
along with their categories. Depending on the type of tem¬ 
plate values, there are different methods for constructing the 
prototype and matching functions M(I' 01 ' clarity, the 
template values used are boolean (i.e., V = {0,1}). In the 
boolean case, M(r i C )[y) = 1 if v lies in the prototypical set 
T, (c); M(r itC )(v) = 0 if v does not lie in the prototypical 
set fa(c). In a more general description of the algorithm, 
the template value space V may be the interval [0,1], the 


circle S' 1 , or another manifold, a function space, a space of 
algorithms or even a measurable space (e.g., @, J22l). When 
V is a metric space 1211 . the matching function M( Tfc C ) may 
use V’s metric to measure the closeness of fa(e) and fa(c) 
as described in recognition comment [4] 

Algorithm 5.1: Template Learning Algorithm 

Allocate memory for the templates. 

Read from memory template set {fa, fa, ■ ■ ■, T n }. 
(The templates read are user-created, created by evolution or an 
alternative method as in m.) 

Outer loop: iterate thru each template Tk. 

{ 

Initialize A' := Tk(c i). 

Initialize A := X. 

Inner loop: iterate thru each category a. 

{ 

Set E Ci := all learning examples in a. 

Build prototype function Tk as follows: 

Set Tfc(ci) := U{t} for each v = Tk(e ) and e £ E Ci . 

Set A :=A\JTk{a). 

Build matching function M(r k ,ci)- 

(See above for boolean case.) 

} 

If {A —= X) remove T k from the template set. 

} 

Store the remaining templates. 

Set each probability Pk = pp where m is the 
number of remaining templates. 

Comments on the Template Learning Algorithm. 

1) The Inner loop assumes that there are a finite number 
of categories. 

2) In some cases, instead of a category threshold, each 
score s c is interpreted as the probability that e lies in 
category c. In other cases, the category threshold 9 is 
empirically determined. 

3) For a fixed template Tk, there should be at least one 
pair of categories ( Ci,Cj ) such that fafai) ^ fa(cj). 
Otherwise, template fa can not separate any categories, 
so fa should be removed. 

4) In some applications, non-uniform probabilities pk can 
be selected based on template fa’s ability to separate 
categories, fa’s computing speed or another property. 

6. Designing Templates with Evolution 

The use of evolutionary methods for optimizing processes 
and algorithms was first introduced by El, 0, 0 and 0 
and were further developed in 021, EH and m. Building 
upon this prior work, this section presents an evolutionary 
method to design the template value functions fa : £ —> V. 

Building blocks are composed to build a useful element. In 
some cases, the building blocks are a collection of functions 
f\ : X —> X, where A £ A, X is a set and A is an 
index set. In some cases. A' is the set of computable real 
numbers. In a handwriting recognition application, X is the 



rational numbers and the binary functions _/} = +, f 2 = —, 
f 3 = *, and f '4 = / are sufficient for the building blocks. 
The index set, A = {1,2, 3,4}, has four elements. In some 
cases, the index set may be infinite. For example, consider 
the functions f(k,b) '■ Z —> Z such that f(k,b){x) = pkX + b 
where b, k £ N and pk is the fcth prime (i.e., pi = 2 , p 2 = 
3,...). 

Bit-sequences [ 6 1 6 2 i> 3 , • • • b n \, where bk € {0,1} en¬ 
code functions composed from building block functions. 
[10110] is a bit-sequence of length 5. The expression 
{/i, fit / 3 i • • • i fr} denotes the building block functions, 
where r = 2 h for some K. Then K bits uniquely represent 
one building block function. The arity of a function is the 
number of arguments that it requires. For example, the arity 
of the real-valued quadratic function /( x) = x 2 is 1. The 
arity of the projection function. Pj : X n —> X, is n, where 
Pi(xi,x 2 , ...,X n )= Xi. 

Define the function V as V(x,y,z) = x if (x > y AND 
x > z ), else \/(x,y,z) = y if (y > x AND y > z ), else 
\/(x,y,z) = z. Consider the functions, {+,— ,*,V}. Each 
sequence of two bits uniquely corresponds to one of these 
functions: 00 <—* + 01 <—* — 10 <—> * 11 

Bit-sequence [ 00 , 01 ] encodes the function 
+(— (x-j,x 2 ),x 3 ) = (xi — x 2 ) + x 3 , which has arity 
3 . For the general case, consider the building block 
functions {/i, /2, h, ■ ■ •, fr}, where r = 2 K . Any bit- 
sequence [bi b 2 ... bx bx +1 bx+2 ■ ■ ■ b 2 K • • • b a i <+1 
b a K+2 ■ ■ ■ fya-t-i)A'] with length (a + 1 )A" is a composition 
of the building block functions, {/i, /2, /3, • • •, fr}- The 
composition of these r building block functions are encoded 
in a similar way, as described for functions {+, —, *, V}. 

The distinct categories are {C\,C 2 ,... ,Cn}- The pop¬ 
ulation size of each generation is m. For each i, where 
1 < i < N, Pc, is the set of all learning examples that 
lie in category C ,. The symbol 7 is an acceptable level of 
performance for a template. The symbol Q is the number 
of distinct templates whose fitness must be greater than 7 . 
The symbol p C rossover is the probability that two templates 
chosen for the next generation will be crossed over. The 
symbol Pmutation is the probability that a template will be 
mutated. 

The main evolution steps are summarized. For each 
category pair ( Ci,Cj ), i < j, the building blocks 
{/ 1 , f 2 , f 3 ,..., f r } are used to build a population of m 
templates. This is accomplished by choosing m multiples 
of K, {li, l 2 ,.. •, l m }- For each a bit sequence of length 
li is constructed. These m bit sequences represent the m 
templates, {T^ j \ T 2 ^\ T 3 ^ j \ ..., Tj^}. The super¬ 
script (i-j) represents that these templates are evolved to 
distinguish examples chosen from Ec, and Eq ■ The fitness 
of each template is determined by how well the template 
can distinguish examples chosen from Pc, and Pc :l ■ Using 
crossover and mutation, the population of bit-sequences are 


evolved until there are at least Q templates which have a 
fitness greater than 7 . When this happens, choose the Q best 
templates from the population that distinguish categories C, 
and Cj. Store these Q best templates in a distinct set T of 
templates that are used in the template learning algorithm. 

Algorithm 6.1: Building Templates with Evolution 

Set T equal to the empty set. 

For each i in {1, 2, 3,..., N} 

For each j in {j + 1, * + 2,..., N} 

< 

Initialize population = {TjFj), _ _ _ T’ m < -*’ J) } 

Set q := 0. 
while ( q < Q) 

{ 

Set G \= 0. 
while (|G| < m) 

{ 

For the next generation, randomly choose 
templates T a and TJ,UA from where 

the probability is proportional to the 
template’s fitness. 

Randomly choose a number r in [0, 1]. 

If (r < Pcrossover), then crossover templates 
templates T a ^’A and 

Randomly choose numbers s a ,Sb in [0,1]. 

If ( s a < Pmutation), mutate template 
If (sb < Pmutation), mutate template Tb^f 

Set G :=GU{r„ (,j) , T b (i ’ j) }. 

} 

Set A ( id j := G. 

For each template in A^jp evaluate 

T a *TA’s ability to distinguish examples 
from categories C; and Cj. 

Store this ability as the fitness of 

Set q equal to the number of templates 
with fitness greater than 7 . 

} 

Based on fitness, choose the Q best 
templates from Aoj) and add them to 7~. 

} 

Comments on Building Templates with Evolution. 

1) The fitness f> a of template is computed by a 

weighted average of three criteria. 

a) The ability of a template to distinguish examples in 
Eci from examples in Ec, 

b) The amount of memory used by the template. 

c) The average amount of time to compute the template 
value function on Pc, and Ec ... 

A quantitative measure for criterion (a) depends on the 
topology of the template value space V. IfV = { 0 , 1 }, 
the ability of template tJ 1 ’ 3 ' 1 to distinguish examples 
in Pc, from examples in Ec j equals 
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4) In the current generation, the collection 

{ 01 , (f> 2 , ■ ■ ■, 4 >m} represents the fitnesses of the 
templates {T/^, The 

probability that template T,} 1 ' 3 ' 1 is chosen for the 
next generation is . 

iz ^ 

k =1 

5) Pcrossover usually ranges from 0.3 to 0.7. 

6 ) Pmutation is usually less than 0.1. 

7. UCI Machine Learning Tests 

Testing against the UCI Machine Learning Repository 
http://archive.ics.uci.edu/ml/ is in progress. 
A subsequent publication will cover these results. 


Bit Sequence A 


a. 

a 2 

a. 


a u 

b„, 


1 s 1 
n 


Bit Sequence B 


b, 

b s 

b, 


b» 

a ut1 


a l 


Fig. 2: Unbounded Crossover 
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When V ^ {0,1}, then V has a metric T>, which 
measures the distance between two points in the 
template value space V. In this case, the ability of 
template Tj 1 ’^ to distinguish examples Ec, and Ec 
equals 

jeewei E E (r,)). 

3 ej&E Cj e t £E Ci 


2) Figure [2] shows a crossover between bit-sequences A = 
[010203 ... ol] and B = [616263 • ■ ■ 6 m]- L and M are 
each multiples of I\, where r = 2 K and the functions 
are {/ 1 , / 2 , / 3 , ■ • •, /r}- In general M. Two natural 
numbers u and w are randomly chosen such that 1 < 
u < L, 1 < w < M and u + M ~ w is a multiple 
of K. The multiple of K condition assures that after 
crossover, the length of each bit sequence is a multiple 
of K. Each bit sequence after crossover is interpreted 
as a composition of functions {/ 1 , / 2 , / 3 > • • •, f r }- The 
numbers u and w identify the crossover locations on A 
and B , respectively. After crossover, bit-sequence A is 
[aia 2 a 3 ... a u b w+ \b w+ 2 ■ ■ ■ 6 m], and bit-sequence B is 
[616263 ... b w a u +\a u +2 ■ ■ ■ o-l\- 


3) Before mutation, the bit-sequence is [616263 ... b n ], A 
mutation randomly selects k and assigns b k the value 
1 - 6 fc . 
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